The objective of this project was to compile data indicative of current data science skills in demand. We did this by creating multiple tables containing strings from blog posts along with populating data from actual job postings.
In doing this, we sought to answer three questions:
Our first main step was creating a table that populated data based on 1) skills in demand from data blogs and 2) job postings with the skills listed.
We used the rvest package to use a node selector and subset the skills of interest:
# Start by reading a HTML page with read_html():
skills_page <- read_html("https://brainstation.io/career-guides/data-science-resume-examples#what-skills-should-you-put-on-a-data-science-resume")
skills_set1<- skills_page %>% html_nodes("ul:nth-child(4)") %>% html_elements("li") %>% html_text2()
skills_set2<- skills_page %>% html_nodes("ul:nth-child(6)") %>% html_elements("li") %>% html_text2()
From here, we can preview the vector:
writeLines(skills_set1)
## Data Courses
## Data Analytics
## Data Science
## Python
## Cybersecurity
## Data analysis
## Data wrangling
## Data modeling
## Statistics
## Data visualization
## Programming
## Quantitative analysis
## Machine learning
## Machine learning models
## Data mining
## Debugging
## Hypothesis testing
## A/B tests
## Regression
## Research the company, the role, and relevant data skills
## Reference resume templates and samples to build a resume outline
## Add relevant education experience, work experience and data projects to the correct section of your resume
## Highlight experience with machine learning and data tools
## Craft concise bullet points using the action verb + task + result format for each experience, emphasizing data-driven successes
## Have a trusted peer proofread your Data Scientist resume for grammar and spelling to make sure your experience is professionally presented
## Work on projects in a collaborative setting
## Take advantage of our flexible plans and scholarships
## Get access to VIP events and workshops
The next step we took was trimming extra spaces from the words in skill_set1 and skill_set2.
vector1 = c()
for (i in 6:19) {
vector1[i-5] <- as.character(trimws(skills_set1[i], which = c("both")))
}
vector2 = c()
for (i in 3:34) {
vector2[i-2] <- as.character(trimws(skills_set2[i], which = c("both")))
}
Here are the cleaned vectors:
vector1 %>% writeLines()
## Data analysis
## Data wrangling
## Data modeling
## Statistics
## Data visualization
## Programming
## Quantitative analysis
## Machine learning
## Machine learning models
## Data mining
## Debugging
## Hypothesis testing
## A/B tests
## Regression
vector2 %>% writeLines()
## R
## Python
## C
## C++
## C#
## HTML
## Java
## JavaScript
## PHP
## SAS
## SQL
## Scala
## MATLAB
## SQL Server
## NoSQL
## Hadoop
## OpenRefine
## TensorFlow
## Cloudera
## Tableau
## Microsoft Excel
## Octave
## Spark
## PowerBI
## Plotly
## Bokeh
## Matplotlib
## Seaborn
## Keras
## Pytorch
## AWS
## Hive
Soft Skills: We can start by reading a HTML page with read_html():
softskills_page <- read_html("https://zety.com/blog/data-scientist-resume-example")
From here, we can scrap these using node selector
skills_set3<- softskills_page %>% html_nodes("td:nth-child(1) > p > a") %>% html_text2()
skills_set3[3:9]
## [1] "Soft Skills" "Communication" "Collaboration"
## [4] "Creative Thinking" "Critical Thinking" "Problem Solving"
## [7] "Interpersonal Skills"
As before, we now have a vector of soft skills:
vector3 = c()
for (i in 3:9) {
vector3[i-2] <- as.character(trimws(skills_set3[i], which = c("both")))
}
vector3
## [1] "Soft Skills" "Communication" "Collaboration"
## [4] "Creative Thinking" "Critical Thinking" "Problem Solving"
## [7] "Interpersonal Skills"
We can now combine the vectors with technical skills and the vectors containing soft skills.
title_vector <- c(vector1, vector2, vector3)
From there, we manually searched multiple job listing sites and populated tables based on the number of results:
tbl1.siteskills = data.frame(skill_title = title_vector,
indeed = indeed_jobs,
linkedin = linkedin_jobs,
simplyhired = simplyhired_jobs,
ziprecruiter = ziprecruiter_jobs
)
head(tbl1.siteskills) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| skill_title | indeed | simplyhired | ziprecruiter | |
|---|---|---|---|---|
| Data analysis | 46,915 | 580,822 | 27,586 | 1,685,359 |
| Data wrangling | 2,660 | 3,329 | 1,233 | 434 |
| Data modeling | 28,328 | 135,443 | 13,702 | 623,648 |
| Statistics | 2,327 | 216,875 | 44,456 | 288,970 |
| Data visualization | 41,727 | 119,686 | 37,409 | 170,268 |
| Programming | 9,444 | 510,656 | 87,187 | 2,047,037 |
We'll refer to this as Table 1 moving forward. We can export this table into a working directory:
# Get working directory path
path <- getwd()
path
## [1] "/Users/deepikadilip/git/Data_607_Project3"
write.csv(tbl1.siteskills, file.path(path, "skills_cleaned_data.csv"))
From here, we can join this table with existing data we've scraped:
tbl2.skillcat = read.csv('SkillCategories.csv')
head(tbl2.skillcat) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| skill_title | SKILLID | skill_category |
|---|---|---|
| Data analysis | 1 | Fundamentals |
| Data wrangling | 2 | Fundamentals |
| Data modeling | 3 | Fundamentals |
| Statistics | 4 | Mathematics |
| Data visualization | 5 | Fundamentals |
| Programming | 6 | Fundamentals |
tbl3.skillrec = read.csv('RecomendedSkills.csv')
head(tbl3.skillrec) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| Skill | RecommendationID | InJobSkills |
|---|---|---|
| C++ | 1 | C++ |
| Collaboration | 1 | Collaboration |
| Collaboration | 7 | Collaboration |
| Communication | 1 | Communication |
| Communication | 2 | Communication |
| Communication | 3 | Communication |
tbl4.skillsites = read.csv('RecommendationSites.csv')
head(tbl4.skillsites, 2) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| RecommendationID | URL | Site | Title | Year |
|---|---|---|---|---|
| 1 | https://www.thebalancecareers.com/list-of-data-scientist-skills-2062381 | the balance careers | Important Job Skills for Data Scientist | 2021 |
| 2 | https://www.analyticsvidhya.com/blog/2020/11/14-must-have-skills-to-become-a-data-scientist-with-resources/ | Analytics Vidhya | 14 Must-Have Skills to Become a Data Scientist (with Resources!) | 2020 |
|
|
|
|
We can start visualization by looking at the job demand distribution in Table 1. To do this, we'll need to melt the dataset prior to using ggplot:
mlt.tbl1 = pivot_longer(tbl1.siteskills, cols = c("indeed", "linkedin", "simplyhired", "ziprecruiter"), values_to = "count")
mlt.tbl1 = mlt.tbl1 %>% mutate(count = as.integer(gsub("\\,", "", count)))
mlt.tbl1.merged = mlt.tbl1 %>% merge(tbl2.skillcat, by = c("skill_title", "SKILLID"), all.x = T)
mlt.tbl1.merged$name = str_to_sentence(mlt.tbl1.merged$name)
plt.sites = ggplot(mlt.tbl1.merged, aes(skill_title, count, fill = factor(skill_category))) + geom_bar(stat = "identity") + facet_wrap(~name, ncol = 1, scales = "free_y") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom")
ggplotly(plt.sites) #Makes it interactive
This plot is a bit detailed, so we can visualize this distribution in more granular "buckets":
mlt.tbl1.merged$skill_bucket = ifelse(mlt.tbl1.merged$skill_category == "Languages", "Languages", ifelse(mlt.tbl1.merged$skill_category == "Soft Skills", "Soft Skills", "Tech Data"))
plt.sites.aggregated = ggplot(mlt.tbl1.merged, aes(skill_title, count, fill = factor(skill_category))) + geom_bar(stat = "identity") + facet_wrap(~ skill_bucket + name, ncol = 2, scales = "free") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom")
ggplotly(plt.sites.aggregated, width = 800, height = 1500)
If we don't want to stratify by job website and visualize an overview instead, it will appear as the following:
tbl1.per_bucket = mlt.tbl1.merged %>% group_by(skill_title, skill_bucket) %>% summarise(count = sum(count))
plt.bucket = ggplot(tbl1.per_bucket, aes(skill_title, count, fill = factor(skill_bucket))) + geom_bar(stat = "identity") + facet_wrap(~ skill_bucket, ncol = 1, scales = "free") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom")
ggplotly(plt.bucket, width = 800, height = 1500)
As an extra bit, we can run some basic statistical tests across the board to see if any particular skills are listed more often in one website than others. We'll start by running a proportions test and seeing which skillls differ the most across websites:
| indeed | simplyhired | ziprecruiter | test.statistic | ||
|---|---|---|---|---|---|
| Communication | 0.0060592 | 0.4471800 | 0.0040841 | 0.5426767 | 13330475 |
| Hypothesis testing | 0.0016622 | 0.0015483 | 0.0010033 | 0.9957862 | 8818391 |
| Programming | 0.0035580 | 0.1923865 | 0.0328472 | 0.7712084 | 5420010 |
| Data analysis | 0.0200433 | 0.2481422 | 0.0117855 | 0.7200290 | 4126554 |
| Problem Solving | 0.3125545 | 0.5271276 | 0.1508640 | 0.0094539 | 2876377 |
We'll do the same for those that differ the least among websites:
prop.table %>% arrange(desc(test.statistic)) %>% tail(5) %>% kable() %>% kable_styling(bootstrap_options = "basic")
| indeed | simplyhired | ziprecruiter | test.statistic | ||
|---|---|---|---|---|---|
| Octave | 0.1481013 | 0.7936709 | 0.0417722 | 0.0164557 | 1701.61 |
| Plotly | 0.1740716 | 0.2115385 | 0.1143899 | 0.5000000 | 1417.67 |
| Seaborn | 0.2858482 | 0.3514527 | 0.2052484 | 0.1574508 | 126.02 |
| Bokeh | 0.2821369 | 0.3005008 | 0.1602671 | 0.2570952 | 37.33 |
| OpenRefine | 0.3557692 | 0.1826923 | 0.3269231 | 0.1346154 | 19.38 |
Seaborn and plotly had similar distributions across the job sites, but Matplotlib was seen more often in ZipRecruiter.Next to categorize the skills in the recommended websites we need to join table 3 and table 2 together, joining on the common skill name. Some skill names are also capitalized which would be counted separately, so we changed the skill name columns in both websites to lower case to avoid this issue.
#Change the skills column to all lower case
tbl3.skillrec$Skill <- tolower(tbl3.skillrec$Skill)
tbl2.skillcat$skill_title <- tolower(tbl2.skillcat$skill_title)
# Join on common skill names, full join or inner?
skillcat_rec <- tbl3.skillrec %>% inner_join(tbl2.skillcat, by = c( "Skill" = "skill_title"))
rec <- skillcat_rec %>% group_by(RecommendationID) %>% count(skill_category)
head(rec) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| RecommendationID | skill_category | n |
|---|---|---|
| 1 | Database Management | 4 |
| 1 | Fundamentals | 7 |
| 1 | Languages | 7 |
| 1 | Mathematics | 2 |
| 1 | Soft Skills | 4 |
| 1 | Software | 1 |
Next to graph the distribution of skill categories among the 12 chosen websites.
# WIP Plan to make this interactive as well colors are hard to distinguish
ggplot(rec, aes(skill_category ,n, fill = skill_category) ) + geom_col() + facet_grid(rows = vars(RecommendationID), scales= "free_y", switch = "y", space = "free_y") + coord_flip() + labs(x = "RecommendationID", y = "Count") +
theme(axis.text.x = element_text(size = 8, hjust = 1, vjust = 1), axis.text.y = element_blank(), axis.ticks.y=element_blank())
(WIP: Use this to compare recommended skill categories vs job postings, question #2. Fundamentals seem to be a more popular category than in the job site postings. Need annotatoins)
(Maybe we can get a more detailed look of what each skill is in the category that they are looking for for the most popular categories to give us more analysis. I started a little of it to see if its worth investigating.)
skillcat_rec %>% filter(RecommendationID == 1 | RecommendationID == 9, skill_category == "Fundamentals")
## Skill RecommendationID InJobSkills SKILLID
## 1 data analysis 1 Data analysis 1
## 2 data mining 1 Data mining 10
## 3 data modeling 1 Data modeling 3
## 4 data visualization 1 Data visualization 5
## 5 data visualization 9 Data visualization 5
## 6 data wrangling 1 Data wrangling 2
## 7 data wrangling 1 Data wrangling 2
## 8 data wrangling 9 Data wrangling 2
## 9 machine learning 1 Machine learning 8
## 10 machine learning 9 Machine learning 8
## skill_category
## 1 Fundamentals
## 2 Fundamentals
## 3 Fundamentals
## 4 Fundamentals
## 5 Fundamentals
## 6 Fundamentals
## 7 Fundamentals
## 8 Fundamentals
## 9 Fundamentals
## 10 Fundamentals
# Data wrangling appears twice for RecommendationID 1, is this an error?
# WIP Need to add color, maybe a different graphical output type. Trying to see common skills between articles
# Fundamentals
skillcat_rec %>%
filter(RecommendationID == 9 | RecommendationID == 1 & skill_category == "Fundamentals") %>%
ggplot(aes(RecommendationID, Skill)) + geom_count() + facet_grid(cols = vars(RecommendationID)) + scale_x_discrete(breaks = c(1,9))
# WIP categorise skills into the three general terms to potentially answer #3